Diamons Prices Previsions¶

  • Dataset: Diamonds Prices
  • Dataset Source: https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices
  • Dataset Description: This dataset contains the prices and other attributes of almost 54,000 diamonds.
  • Dataset size: 53,940 rows, 10 columns

Dataset Description¶

  • Number of Features: 10
  • Features: carat, cut, color, clarity, depth, table, price, x, y, z
  • Price of the diamonds are in US dollars ($326–$18,823)
  • x (lenght), y (width), z (depth) are the dimensions of the diamonds in mm
  • clarity:

Objectives of the project¶

  • Analyze the dataset
  • Clean the dataset
  • Predict the price of the diamonds
  • Classify the cut of the diamonds
  • Classify the color of the diamonds
  • Classify the clarity of the diamonds

Diamonds Information¶

Diamonds Carat¶

  • The carat of the diamonds is the unit of measurement for the weight of the diamonds.
  • The carat of the diamonds is divided into 100 points. (1 carat = 100 points)
  • The carat of the diamonds is the most important factor in determining the price of the diamonds.

Diamond carat

Diamonds Cut¶

  • Diamonds are cut into different shapes like: round, heart and oval. But in the dataset are contained only the price and information of diamonds cut in round shape.
  • This means tha from this dataset we can only create a models that works with round diamonds.
  • In this dataset the cut of the diamonds is classified in 5 different categories of quality: Ideal, Premium, Very Good, Good and Fair.

Diamond round cut

Diamonds Color¶

  • Diamonds are classified in 7 different colors: (best) D, E, F, G, H, I and J (worst).
  • The color of the diamonds is determined by the absence of color in the diamonds. The more colorless the diamond is, the higher the price.
  • So we can tell that the diamonds with color D are more clear than the diamonds with color J.

Diamond colors

Diamonds Clarity¶

  • Diamonds are classified in 8 different clarity: (best) IF, VVS1, VVS2, VS1, VS2, SI1, SI2 and I1 (worst).
  • The clarity of the diamonds is determined by the presence of inclusions and blemishes in the diamonds. The more inclusions and blemishes the diamond has, the lower the price.
  • Blemishes are the surface defects of the diamonds. Inclusions are the internal defects of the diamonds.
  • So we can tell that the diamonds with clarity IF are more clear or with less difects than the diamonds with clarity I1.

Diamond clarity

Diamonds depth, table and carat¶

  • The depth of the diamonds is the height of the diamond, measured from the culet to the table, divided by its average girdle diameter.
  • The table of the diamonds is the width of the diamond's table expressed as a percentage of its average diameter.

Diamond clarity

Features Description¶

  • carat: weight of the diamond (0.2–5.01)
  • cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color: diamond colour, from J (worst) to D (best)
  • clarity: how traslucent is the diamond (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table: width of top of diamond relative to widest point (43–95)
  • price: price in US dollars ($326 – $18,823)
  • x: length in mm (0–10.74)
  • y: width in mm (0–58.9)
  • z: depth in mm (0–31.8)

Install Dependencies¶

In [ ]:
! pip3 install numpy
! pip3 install pandas
! pip3 install matplotlib
! pip3 install seaborn
! pip3 install scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.23.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (2.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.20.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.0)
Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (4.39.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.20 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.23.0)
Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (9.5.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (5.12.0)
Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib) (3.15.0)
Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (1.23.0)
Requirement already satisfied: pandas>=0.25 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (2.0.0)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.5.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0)
Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3)
Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0)
Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.23.0)
Requirement already satisfied: scipy>=1.3.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.9.1)
Requirement already satisfied: joblib>=1.1.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (3.1.0)

Imports¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import classification_report

import seaborn as sns 
sns.set_style('white')
sns.set_context('notebook')

# suppress warnings
import warnings;
warnings.simplefilter('ignore')

Analyze and Clean the dataset¶

In [ ]:
%matplotlib inline

df = pd.read_csv("Diamonds Prices2022.csv")
df.sample(10)
Out[ ]:
Unnamed: 0 carat cut color clarity depth table price x y z
13139 13140 1.07 Premium F SI1 62.5 59.0 5436 6.57 6.52 4.09
42970 42971 0.41 Ideal E VVS1 60.9 55.0 1367 4.80 4.83 2.93
36285 36286 0.34 Good E SI2 63.7 55.0 477 4.43 4.46 2.83
101 102 0.75 Premium E SI1 59.9 54.0 2760 6.00 5.96 3.58
20422 20423 1.01 Ideal G IF 62.8 57.0 8778 6.42 6.39 4.02
30462 30463 0.31 Ideal D SI1 62.7 56.0 732 4.36 4.32 2.72
18825 18826 1.70 Very Good J SI1 59.1 61.0 7713 7.79 7.85 4.62
50956 50957 0.66 Very Good H VVS1 61.9 59.0 2323 5.59 5.65 3.48
30488 30489 0.31 Ideal D SI1 62.3 54.0 732 4.37 4.33 2.71
23203 23204 1.06 Ideal D VVS2 61.1 56.0 11209 6.58 6.59 4.02

There are some NaN values in the dataset.

In [ ]:
# count the numbers of NaN values in each column
df.isnull().sum()
Out[ ]:
Unnamed: 0    0
carat         0
cut           0
color         0
clarity       0
depth         0
table         0
price         0
x             0
y             0
z             0
dtype: int64

Because the null values are in a column that have no meaning or relations to the other columns, we can drop this column.

In [ ]:
# drop first column (unnamed)
df = df.drop(df.columns[0], axis=1)
df.head()
Out[ ]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [ ]:
# check the data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53943 entries, 0 to 53942
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53943 non-null  float64
 1   cut      53943 non-null  object 
 2   color    53943 non-null  object 
 3   clarity  53943 non-null  object 
 4   depth    53943 non-null  float64
 5   table    53943 non-null  float64
 6   price    53943 non-null  int64  
 7   x        53943 non-null  float64
 8   y        53943 non-null  float64
 9   z        53943 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB

Before viewing the correlation between the features, we need to convert the categorical features in numerical features.
Like this:

  • cut --> 1 (ideal), 2 (premium), 3 (very good), 4 (good), 5 (fair)
  • color --> 1 (D), 2 (E), 3 (F), 4 (G), 5 (H), 6 (I), 7 (J)
  • clarity --> 1 (IF), 2 (VVS1), 3 (VVS2), 4 (VS1), 5 (VS2), 6 (SI1), 7 (SI2), 8 (I1)

so less is the value of this features, more is the quality of the diamonds ==> better is the price of the diamonds, but how much?

In [ ]:
# convert categorical data to numerical data
cutMap = {'Ideal': 5, 'Premium': 4, 'Very Good': 3, 'Good': 2, 'Fair': 1}
colorMap = {'D': 1, 'E': 2, 'F': 3, 'G': 4, 'H': 5, 'I': 6, 'J': 7}
clarityMap = {'IF': 1, 'VVS1': 2, 'VVS2': 3, 'VS1': 4, 'VS2': 5, 'SI1': 6, 'SI2': 7, 'I1': 8}

df['cut'] = df['cut'].map(cutMap)
df['color'] = df['color'].map(colorMap)
df['clarity'] = df['clarity'].map(clarityMap)

df.head()
Out[ ]:
carat cut color clarity depth table price x y z
0 0.23 5 2 7 61.5 55.0 326 3.95 3.98 2.43
1 0.21 4 2 6 59.8 61.0 326 3.89 3.84 2.31
2 0.23 2 2 4 56.9 65.0 327 4.05 4.07 2.31
3 0.29 4 6 5 62.4 58.0 334 4.20 4.23 2.63
4 0.31 2 7 7 63.3 58.0 335 4.34 4.35 2.75
In [ ]:
# heatmap that shows the correlation between the different features
sns.heatmap(df.corr())
Out[ ]:
<Axes: >
  • We can see that the table and depth columns have a lot of correlation between them, and with the other columns too. This means that we can drop one of the two columns.
  • The correlation between Table and Depth is so high because are besed both on the same value: the width/diameter of the diamond.
  • This could explain also why the dimension of the diamonds are somehow correlated with the table and depth columns.
  • Now i want to see the distrubuation of the table and depth columns, like this i can see the relation between the two columns
In [ ]:
# 2 side by side plots of price vs depth for both price vs table
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))

axis1.scatter(df['depth'], df['price'])
axis1.set_title('Depth vs Price')
axis1.set_xlabel('Depth')
axis1.set_ylabel('Price')

axis2.scatter(df['depth'], df['price'])
axis2.set_title('Table vs Price')
axis2.set_xlabel('Table')
Out[ ]:
Text(0.5, 0, 'Table')

Correlation between the table and depth columns¶

  • from the graph we can see that the table and depth columns are almost the same, so we can drop one of the two columns.

Price Prediction¶

  • we can see that from only depth or table we can't predict the price of the diamonds,
  • because i think the price is highly correlated with the carat and "quality" (clarity, cut, color) of the diamonds.
  • so in the and we can drop the depth and table columns.
In [ ]:
#drop depth and table columns
dfCleaned = df.drop(['depth', 'table'], axis=1)
dfCleaned.head()
Out[ ]:
carat cut color clarity price x y z
0 0.23 5 2 7 326 3.95 3.98 2.43
1 0.21 4 2 6 326 3.89 3.84 2.31
2 0.23 2 2 4 327 4.05 4.07 2.31
3 0.29 4 6 5 334 4.20 4.23 2.63
4 0.31 2 7 7 335 4.34 4.35 2.75
In [ ]:
#custom legend for cut column
cutLegend = {1: 'Fair', 2: 'Good', 3: 'Very Good', 4: 'Premium', 5: 'Ideal'}

# plot carat vs price, with cut hue and custom cutLegend
sns.lmplot(x='carat', y='price', data=dfCleaned, hue='cut', fit_reg=False, legend=False)
plt.legend(cutLegend.values())
Out[ ]:
<matplotlib.legend.Legend at 0x290f90bb0>

Price Prediction¶

  • This tell that only from the carat we can predict the price of the diamonds with a good accuracy.
  • But we can also see that the price of the diamonds is not linear, so we can try to use a polynomial regression model.

Cut Influence on the Price¶

  • The type of cut influence the price of the diamonds, but is not the only factor because we can see the same price for different cut types,
  • this is obvious because the price of the diamonds is influenced by other factors like the carat, color, clarity, etc.
In [ ]:
# put 3 scatter plot aside each other, for x and y and z in relation to the price
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

ax1.scatter(dfCleaned['x'], dfCleaned['price'], picker=True)
ax1.set_title('x vs price')
ax1.set_xlabel('x')
ax1.set_ylabel('price')

ax2.scatter(dfCleaned['y'], dfCleaned['price'])
ax2.set_title('y vs price')
ax2.set_xlabel('y')

ax3.scatter(dfCleaned['z'], dfCleaned['price'])
ax3.set_title('z vs price')
ax3.set_xlabel('z')
Out[ ]:
Text(0.5, 0, 'z')

Dataset Column Adjustments¶

In [ ]:
# convert coulmns to binary values using get_dummies
dfAdjusted = pd.get_dummies(dfCleaned, columns=['cut', 'color', 'clarity'])
dfAdjusted.head()
Out[ ]:
carat price x y z cut_1 cut_2 cut_3 cut_4 cut_5 ... color_6 color_7 clarity_1 clarity_2 clarity_3 clarity_4 clarity_5 clarity_6 clarity_7 clarity_8
0 0.23 326 3.95 3.98 2.43 False False False False True ... False False False False False False False False True False
1 0.21 326 3.89 3.84 2.31 False False False True False ... False False False False False False False True False False
2 0.23 327 4.05 4.07 2.31 False True False False False ... False False False False False True False False False False
3 0.29 334 4.20 4.23 2.63 False False False True False ... True False False False False False True False False False
4 0.31 335 4.34 4.35 2.75 False True False False False ... False True False False False False False False True False

5 rows × 25 columns

Dataset Splitting¶

In [ ]:
# split the data into training and testing sets
X = dfAdjusted.drop('price', axis=1)
y = dfAdjusted['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Standardization¶

In [ ]:
# standardize the data
scaler = preprocessing.StandardScaler().fit(X_train)

# transform the data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

Regression¶

Polynomial Regression¶

  • This take to much time
In [ ]:
# Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures

poly_reg = PolynomialFeatures(degree = 8)
X_poly = poly_reg.fit_transform(X_train_scaled)
poly_reg.fit(X_poly, y_train)


# Predicting a new result with Polynomial Regression
y_pred = poly_reg.fit_transform(X_test_scaled)

#print accuracy of the model
print(classification_report(y_test, y_pred))

Logical Regression¶

  • This take to much time
In [ ]:
# Logistic Regression
logreg = LogisticRegression(solver='sag', max_iter=1)
logreg.fit(X_train_scaled, y_train)

# Predicting a new result with Logistic Regression
y_pred = logreg.predict(X_test_scaled)

#print accuracy of the model
print(classification_report(y_test, y_pred))

Ridge Regression¶

  • Accuracy: 0.92
  • linear regression with regularization
  • Used when the dataset have many features (high-dimensional dataset)
  • and used for avoiding to take in consideration columns that are not important for the prediction
  • Used to improve interpretability of featues in correlation with the target
  • the penality ('punizione') is the coefficient of the regularization term
  • the penality is necessary to avoid overfitting
  • the penality garantee that the model is not too complex, and so it can generalize better

Ridge Regression with SAGA Solver¶

In [ ]:
# Ridge regression
from sklearn.linear_model import Ridge

# SAG is a stochastic optimization algorithm that is particularly useful for large-scale linear regression problems.
# SAGA is a variant of SAG that also supports the non-smooth penalty=l1
#ridgeReg = Ridge(alpha=0.05, solver='sag')
ridgeReg = Ridge(alpha=0.05, solver='saga')
ridgeReg.fit(X_train_scaled, y_train)

# Predicting a new result with Ridge Regression
y_pred = ridgeReg.predict(X_test_scaled)

#print accuracy of the model
print(ridgeReg.score(X_test_scaled, y_test))
0.922425670476217
  • with this model we can see that the accuracy is high.

Ridge regression with GridSearchCV¶

In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]

ridge = Ridge()

parameters = {'alpha': alpha}

# GridSearchCV will try all the combinations of the parameters
ridge_regressor = GridSearchCV(ridge, parameters,scoring='neg_mean_squared_error', cv=5)

ridge_regressor.fit(X_train_scaled, y_train)

print(ridge_regressor.best_params_)
print(-ridge_regressor.best_score_)
{'alpha': 20}
1295574.3904235393
In [ ]:
ridge = Ridge(alpha=ridge_regressor.best_params_['alpha'])
ridge.fit(X_train_scaled, y_train)

# Predicting a new result with Ridge Regression
y_pred = ridge.predict(X_test_scaled)

#print accuracy of the model
print(ridge.score(X_test_scaled, y_test))
0.922375573729045
  • with this model we can see that the accuracy is high.

Lasso Regression¶

  • Accuracy: 0.92
In [ ]:
#Lasso regression
from sklearn.linear_model import Lasso

lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train_scaled, y_train)

# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test_scaled)

#print accuracy of the model
print(lassoReg.score(X_test_scaled, y_test))
0.9223987609445108

Lasso Regression with only x, y, z¶

In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x', 'y', 'z']]
X_test = X_test_scaled[['x', 'y', 'z']]

# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 0.6, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]
In [ ]:
lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train, y_train)

# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test)

#print accuracy of the model
print(lassoReg.score(X_test, y_test))
0.7804924706294081
In [ ]:
lasso = Lasso()

parameters = {'alpha': alpha}

# GridSearchCV will try all the combinations of the parameters
lasso_regressor = GridSearchCV(lasso, parameters,scoring='neg_mean_squared_error', cv=5)

lasso_regressor.fit(X_train, y_train)

print(lasso_regressor.best_params_)
print(-lasso_regressor.best_score_)

lasso = Lasso(alpha=lasso_regressor.best_params_['alpha'])
lasso.fit(X_train, y_train)

# Predicting a new result with Lasso Regression
y_pred = lasso.predict(X_test)

#print accuracy of the model
print(lasso.score(X_test, y_test))
{'alpha': 50}
3462425.630456732
0.7803486557847755
  • we can see that with lasso and using only the x,y,z it is possible tu obtain a fair accuracy.
  • it is possible only with x or carat ?

SGD Regression¶

  • Accuracy: 0.92
  • SGD = Stochastic Gradient Descent
  • simple and efficent optimization
  • used for large datasets with millions of features
  • used for training large-scale machine learning models
In [ ]:
#importing the SGD Regressor
from sklearn.linear_model import SGDRegressor

# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train_scaled, y_train)

# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test_scaled)

#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test_scaled, y_test))
0.9222990712101318
  • with this model we can see that the accuracy is high.

SGD Regression with only x¶

  • accuracy: 0.78
In [ ]:
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x']]
X_test = X_test_scaled[['x']]
In [ ]:
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)

# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)

#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.780279120075163

SGD Regression with only Carat¶

In [ ]:
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['carat']]
X_test = X_test_scaled[['carat']]
In [ ]:
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)

# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)

#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.7805163504962933

We Can see that the accuracy obtained train only with x or carat is fair, but obviusly is not the best model.

Decision Tree Regression¶

  • accurancy: 0.96
In [ ]:
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeRegressor

dtree = DecisionTreeRegressor()
dtree.fit(X_train_scaled, y_train)

# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)

#print accuracy of the model
print(dtree.score(X_test_scaled, y_test))
0.9649467463950754
  • with this model we can see that the accuracy is very high.

Random Forest Regression¶

  • accurancy: 0.98
In [ ]:
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=100, random_state=0)
rfr.fit(X_train_scaled, y_train)

# Predicting a new result with Random Forest Regression
y_pred = rfr.predict(X_test_scaled)

#print accuracy of the model
print(rfr.score(X_test_scaled, y_test))
0.9784884722052908
  • with this model we can see that the accuracy is very high.

Support Vector Regression¶

  • accurancy: 0.36
In [ ]:
# Support Vector Machine model
from sklearn.svm import SVR

svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)

# Predicting a new result with Support Vector Machine
y_pred = svr_model.predict(X_test_scaled)

#print accuracy of the model
print(svr_model.score(X_test_scaled, y_test))
  • from this regression model we can see that the accuracy is really bad

Price Prediction¶

  • from the above plots we can see that the price is some what correlated with the carat, x, y and z features
  • so we can try to predict the price of the diamonds using only this features or even better with also this features.

Classification¶

Prepare Data¶

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Feature Scaling
sc = StandardScaler()

# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)  
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

Color Classification¶

KNN Classification¶

  • Accurancy: 0.43
In [ ]:
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)

#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy:  0.4326628973954954

Best KNN CLASSIFICATION¶

  • Accurancy: 0.45
In [ ]:
error = []

# KNeighborsClassifier works well with not to much feautures, up to 4 or 5 works really well
# Calculating error for K values between 1 and 40
for i in range(1, 60):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test)) # calculate mean error
In [ ]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 60), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Out[ ]:
Text(0, 0.5, 'Mean Error')
In [ ]:
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_scaled, y_train)

# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)

#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy:  0.44758550375382333

SVM Classification¶

  • Accurancy: 0.38
In [ ]:
# SVM model for classification of the color
from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)

# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)

#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy:  0.3821484845676152
In [ ]:
# best SUPPORT VECTOR MACHINE
from sklearn.svm import SVC

svc_model = SVC(C=1, gamma=0.1)
svc_model.fit(X_train_scaled, y_train)

# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)

#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy:  0.3745481508944295

Decision Tree Classification¶

  • Accurancy: 0.51
In [ ]:
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train_scaled, y_train)

# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)

#print accuracy of the model
print('Decision Tree Classification Accurancy: ',dtree.score(X_test_scaled, y_test))
Decision Tree Classification Accurancy:  0.5130225229400315

Random Forest Classification¶

  • Accurancy: 0.56
In [ ]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train_scaled, y_train)

# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test_scaled)

#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test_scaled, y_test))
Random Forest Classification Accurancy:  0.17582723143942905
In [ ]:
# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
Out[ ]:
<Axes: >
In [ ]:
X = X_train_scaled.drop(['carat','clarity','cut'], axis=1)
X_test = X_test_scaled.drop(['carat','clarity','cut'], axis=1)

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X, y_train)

# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)

#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))
Random Forest Classification Accurancy:  0.1738808045231254
  • we can say that all features are important to predict the color of the diamonds
In [ ]:
# add volume of the diamond V = (4/3) x π x (r1 x r2 x r3)
dfCleaned['volume'] = (4/3) * np.pi * (dfCleaned['x'] * dfCleaned['y'] * dfCleaned['z'])

# add density of the diamond
dfCleaned['density'] = dfCleaned['carat'] / dfCleaned['volume']

# add price per carat
dfCleaned['price_per_carat'] = dfCleaned['price'] / dfCleaned['carat']

# add price per volume
dfCleaned['price_per_volume'] = dfCleaned['price'] / dfCleaned['volume']

# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Feature Scaling
sc = StandardScaler()

# Drop the new columns before scaling
X_train = X_train.drop(['density', 'price_per_volume'], axis=1)
X_test = X_test.drop(['density', 'price_per_volume'], axis=1)

# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
In [ ]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)

#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))

# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
Random Forest Classification Accurancy:  0.6272129020298453
Out[ ]:
<Axes: >

Clustering¶

See correlation between the features, in relation with the color, clarity and cut¶

In [ ]:
sns.pairplot(dfCleaned, hue='color', palette='coolwarm')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x17df6ec40>
In [ ]:
sns.pairplot(dfCleaned, hue='clarity', palette='coolwarm')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x17df5a970>
In [ ]:
sns.pairplot(dfCleaned, hue='cut', palette='coolwarm')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x29f2cfa60>
  • If we consider the features color, clarity and cut
  • We can predict that the fauture clarity has data more concentrated in spots
  • So we can try to use a clustering algorithm to predict the clarity of the diamonds

K-Means Clustering¶

  • Make round clusters
  • Not good if data is not organized in round clusters
  • So not good for this dataset
In [ ]:
# Clustering with K-Means
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(dfCleaned.drop('clarity', axis=1))

# view the cluster centeroids
print(kmeans.cluster_centers_)

# view the labels
print(kmeans.labels_)

# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_

# plot the clusters
sns.lmplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', aspect=1, fit_reg=False)
[[4.92848617e-01 3.98960889e+00 3.35044988e+00 1.46154640e+03
  4.99386077e+00 5.00082462e+00 3.08522615e+00 9.08843043e-04]
 [1.11737698e+00 3.72306084e+00 3.91341567e+00 5.79390642e+03
  6.62044893e+00 6.61670130e+00 4.08931059e+00 1.99954257e+00]
 [1.71822589e+00 3.89469011e+00 4.15485704e+00 1.33608863e+04
  7.63643758e+00 7.63806784e+00 4.70089505e+00 1.00000000e+00]]
[0 0 0 ... 0 0 0]
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x292802e50>
In [ ]:
kmeans = KMeans(n_clusters=8)
kmeans.fit(dfCleaned.drop('clarity', axis=1))

# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_

# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('KMeans')
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1, legend=True)

ax2.set_title("Original")
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2, legend=True)
Out[ ]:
<Axes: title={'center': 'Original'}, xlabel='x', ylabel='carat'>
  • We can clearly see that the clusters are not round, so the K-Means algorithm is not good for this dataset

Agglomerative Clustering¶

In [ ]:
# Clustering models: K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Model (GMM), Mean Shift, Spectral Clustering, Affinity Propagation,
# Agglomerative Clustering, Birch, Mini-Batch K-Means, OPTICS, and more.
# Clastering with Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters=8)
agg.fit(dfCleaned.drop('clarity', axis=1))

# view the labels
print(agg.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = agg.labels_
[5 5 5 ... 0 0 0]
In [ ]:
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('AgglomerativeClustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)

ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
Out[ ]:
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>

DBSCAN Clustering¶

In [ ]:
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('DBSCAN CLustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)

ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
Out[ ]:
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>
In [ ]:
import umap
In [ ]:
# create a UMAP model with 2 dimensions
# n_neighbors default is 15
umapModel = umap.UMAP(n_components=2, n_neighbors=5, random_state=42, min_dist=0.1)

# fit the model to the data
manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
In [ ]:
#! pip install pandas matplotlib datashader bokeh holoviews colorcet scikit-image  #pip
#! pip install umap-learn[plot]
import umap.plot  

y = dfCleaned['clarity'].values.flatten()

# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);

# Image size of 636x251086 pixels is too large. It must be less than 2^16 in each direction.
# plot the UMAP model with the colors genereated by the UMAP model
In [ ]:
# plot different UMAP models with different parameters
fig, ax_array = plt.subplots(4, 4, figsize=(15, 15))
a = 0
b = 0
for n in [5, 10, 15, 20]:
    for d in [0.1, 0.25, 0.5, 0.75]:
        umapModel = umap.UMAP(n_components=2, n_neighbors=n, random_state=42, min_dist=d)
        manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
        umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000, ax=ax_array[a, b])
        ax_array[a, b].set_title(f"n_neighbors={n}, min_dist={d}")
        b += 1
    a += 1
    b = 0

# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(